AML2019
Challenge 1


Group 22

  • NGUYEN Thanh-Long
  • DANG Ngoc-Vien

Frame the Problem

The model's output: a prediction of a house pricing

  • It is a supervised learning task.
  • It is a regression problem.
  • The data is small enough to fit in memory so batch learning should do fine.
  • Selecting a performance measure for regression problems is the Root Mean Square Error (RMSE).

Setup

In [ ]:
!pip3 install --user xgboost
!pip3 install --user missingno
In [99]:
# Import Libraries
import pandas as pd
import numpy as np
import math
import scipy.stats as ss
from scipy.stats import uniform
from scipy.stats import randint as sp_randint

from xgboost import XGBRegressor
from scipy import stats


from math import ceil
from math import sqrt
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge

from sklearn.feature_selection import RFECV
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer

from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import make_scorer 

import warnings
warnings.filterwarnings('ignore')
import missingno as msno # Missingno package for visualizing missing data
from collections import Counter

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
%matplotlib inline
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

Get the data

In [100]:
urltrain = "https://raw.githubusercontent.com/DistributedSystemsGroup/Algorithmic-Machine-Learning/master/Challenges/House_Pricing/challenge_data/train.csv"
urltest = "https://raw.githubusercontent.com/DistributedSystemsGroup/Algorithmic-Machine-Learning/master/Challenges/House_Pricing/challenge_data/test.csv"
dftrain = pd.read_csv(urltrain)
dftest = pd.read_csv(urltest)
dftrain.shape, dftest.shape
Out[100]:
((1200, 81), (260, 80))

Take a Quick Look at the Data Structure

In [101]:
dftrain.head()
Out[101]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

Each row represents 1 house. There are 81 attributes: Id, MSSubClass, MSZoning, ..., and SalePrice.

In [102]:
dftrain.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 81 columns):
Id               1200 non-null int64
MSSubClass       1200 non-null int64
MSZoning         1200 non-null object
LotFrontage      990 non-null float64
LotArea          1200 non-null int64
Street           1200 non-null object
Alley            75 non-null object
LotShape         1200 non-null object
LandContour      1200 non-null object
Utilities        1200 non-null object
LotConfig        1200 non-null object
LandSlope        1200 non-null object
Neighborhood     1200 non-null object
Condition1       1200 non-null object
Condition2       1200 non-null object
BldgType         1200 non-null object
HouseStyle       1200 non-null object
OverallQual      1200 non-null int64
OverallCond      1200 non-null int64
YearBuilt        1200 non-null int64
YearRemodAdd     1200 non-null int64
RoofStyle        1200 non-null object
RoofMatl         1200 non-null object
Exterior1st      1200 non-null object
Exterior2nd      1200 non-null object
MasVnrType       1194 non-null object
MasVnrArea       1194 non-null float64
ExterQual        1200 non-null object
ExterCond        1200 non-null object
Foundation       1200 non-null object
BsmtQual         1168 non-null object
BsmtCond         1168 non-null object
BsmtExposure     1167 non-null object
BsmtFinType1     1168 non-null object
BsmtFinSF1       1200 non-null int64
BsmtFinType2     1167 non-null object
BsmtFinSF2       1200 non-null int64
BsmtUnfSF        1200 non-null int64
TotalBsmtSF      1200 non-null int64
Heating          1200 non-null object
HeatingQC        1200 non-null object
CentralAir       1200 non-null object
Electrical       1200 non-null object
1stFlrSF         1200 non-null int64
2ndFlrSF         1200 non-null int64
LowQualFinSF     1200 non-null int64
GrLivArea        1200 non-null int64
BsmtFullBath     1200 non-null int64
BsmtHalfBath     1200 non-null int64
FullBath         1200 non-null int64
HalfBath         1200 non-null int64
BedroomAbvGr     1200 non-null int64
KitchenAbvGr     1200 non-null int64
KitchenQual      1200 non-null object
TotRmsAbvGrd     1200 non-null int64
Functional       1200 non-null object
Fireplaces       1200 non-null int64
FireplaceQu      636 non-null object
GarageType       1133 non-null object
GarageYrBlt      1133 non-null float64
GarageFinish     1133 non-null object
GarageCars       1200 non-null int64
GarageArea       1200 non-null int64
GarageQual       1133 non-null object
GarageCond       1133 non-null object
PavedDrive       1200 non-null object
WoodDeckSF       1200 non-null int64
OpenPorchSF      1200 non-null int64
EnclosedPorch    1200 non-null int64
3SsnPorch        1200 non-null int64
ScreenPorch      1200 non-null int64
PoolArea         1200 non-null int64
PoolQC           4 non-null object
Fence            227 non-null object
MiscFeature      47 non-null object
MiscVal          1200 non-null int64
MoSold           1200 non-null int64
YrSold           1200 non-null int64
SaleType         1200 non-null object
SaleCondition    1200 non-null object
SalePrice        1200 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 759.5+ KB
As you see the result above, there are 12,000 instances in the dataset, which means that it is fairly small by Machine Learning standards. Notice that the LotFrontage attribute has only 990 non-null values, meaning that 210 houses are missing this feature. Similarly, there are also some other attributes having missing values including: Alley, MasVnrType, MasVnrArea, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature. We will need to take care of this later.
In [103]:
dftrain.describe()
Out[103]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1200.000000 1200.000000 990.000000 1200.000000 1200.000000 1200.000000 1200.000000 1200.000000 1194.000000 1200.000000 ... 1200.000000 1200.000000 1200.000000 1200.000000 1200.000000 1200.000000 1200.000000 1200.000000 1200.000000 1200.000000
mean 600.500000 57.075000 70.086869 10559.411667 6.105000 5.568333 1971.350833 1984.987500 103.962312 444.886667 ... 95.136667 46.016667 22.178333 3.653333 14.980833 1.909167 40.453333 6.311667 2007.810833 181414.628333
std 346.554469 42.682012 23.702029 10619.135549 1.383439 1.120138 30.048408 20.527221 183.534953 439.987844 ... 124.034129 65.677629 61.507323 29.991099 54.768057 33.148327 482.323444 2.673104 1.319027 81070.908544
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1875.000000 1950.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 300.750000 20.000000 59.000000 7560.000000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129900.000000
50% 600.500000 50.000000 70.000000 9434.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 385.500000 ... 0.000000 24.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163700.000000
75% 900.250000 70.000000 80.000000 11616.000000 7.000000 6.000000 2000.000000 2004.000000 166.750000 712.250000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 1200.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 2260.000000 ... 857.000000 523.000000 552.000000 508.000000 410.000000 648.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 38 columns

As we mentioned, there are 81 attributes which 38 attributes are numerical and above is the summary of each numerical attribute.

Discover - Clean - Visualize the Data to Gain Insights

1. SalePrice - The Target Variable

In [104]:
# log transformation
dftrain['LogSalePrice'] = np.log(dftrain['SalePrice'])
dftrain[['SalePrice','LogSalePrice']].describe()
Out[104]:
SalePrice LogSalePrice
count 1200.000000 1200.000000
mean 181414.628333 12.024861
std 81070.908544 0.403556
min 34900.000000 10.460242
25% 129900.000000 11.774520
50% 163700.000000 12.005790
75% 214000.000000 12.273731
max 755000.000000 13.534473
The mean value of SalePrice is roughly \$181,415, and 25% of the houses have a SalePrice lower than \$129,900 while 50% are lower than \$163,700 and 75% are lower than \$214,000.
In [105]:
plt.subplots(figsize =(20, 6))
# historam
plt.subplot(1, 3, 1)
sns.distplot(dftrain['SalePrice'], color='navy')
# boxplot
plt.subplot(1, 3, 2)
sns.boxplot(data=dftrain['SalePrice'], color='navy').set_title('SalePrice')
# historam of log
plt.subplot(1, 3, 3)
sns.distplot(dftrain['LogSalePrice'], color='navy')
Out[105]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f965a367630>
- The SalePrice are heavily right skewness because there are some houses are extremely expensive.
- The Boxplot also clearly illustrated those outliers.
- We have applied the log transformation on the SalePrice in order to normalize it.

2. Dealing With Missing Data

In [106]:
# Merge the train and test data
dftotal = pd.concat((dftrain, dftest)).reset_index(drop=True)
dftotal.drop(['SalePrice','LogSalePrice'], axis=1, inplace=True)
# Checking as 1200 + 260 = 1460
print(format(dftrain.shape))
print(format(dftest.shape))
print(format(dftotal.shape))
(1200, 82)
(260, 80)
(1460, 80)
In [107]:
# Calculate the missing attributes 
# The results are the total numbers of attributes having missing values and percentages of non-missing values of these attribute
total = dftotal.isnull().sum().sort_values(ascending=False)
percent = (dftotal.isnull().sum()/dftotal.isnull().count()*100).sort_values(ascending=False)
missing = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing = missing[missing['Total']>0]
print("The below table of the attributes having missing values and percentages of non-missing values of these attributes:")
print()
print(missing)
print()
print("The number of attributes have missing values are", len(missing))
The below table of the attributes having missing values and percentages of non-missing values of these attributes:

              Total    Percent
PoolQC         1453  99.520548
MiscFeature    1406  96.301370
Alley          1369  93.767123
Fence          1179  80.753425
FireplaceQu     690  47.260274
LotFrontage     259  17.739726
GarageFinish     81   5.547945
GarageCond       81   5.547945
GarageType       81   5.547945
GarageQual       81   5.547945
GarageYrBlt      81   5.547945
BsmtExposure     38   2.602740
BsmtFinType2     38   2.602740
BsmtCond         37   2.534247
BsmtFinType1     37   2.534247
BsmtQual         37   2.534247
MasVnrArea        8   0.547945
MasVnrType        8   0.547945
Electrical        1   0.068493

The number of attributes have missing values are 19
In [108]:
# Illustrating the missing values of each attribute having missing values
msno.matrix(dftotal[missing.index])
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f965b5b61d0>

Notes

- Not all the NaN values are missing value. - There are attributes get NA as its value also, for example: PoolQC: Pool quality Ex Excellent Gd Good TA Average/Typical Fa Fair NA No Pool - There are 15 attributes get NaN value but it is not mean "missing", it means that attribute does not have these type of values in the house
In [109]:
# Therefore,replacing those NaN with "None"
none_to_fill_col = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageFinish', 
                    'GarageCond', 'GarageType', 'GarageQual', 'BsmtExposure', 'BsmtFinType2', 
                    'BsmtCond', 'BsmtFinType1', 'BsmtQual', 'MasVnrType']
for col in none_to_fill_col:
    dftotal[col] = dftotal[col].fillna("None")
    dftrain[col] = dftrain[col].fillna("None")
    dftest[col] = dftest[col].fillna("None")

# Re-check the missing attributes
total = dftotal.isnull().sum().sort_values(ascending=False)
percent = (dftotal.isnull().sum()/dftotal.isnull().count()*100).sort_values(ascending=False)
missing = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing = missing[missing['Total']>0]
print("The below table of the attributes having missing values and percentages of non-missing values of these attributes:")
print()
print(missing)
print()
print("The number of attributes have missing values are", len(missing))
The below table of the attributes having missing values and percentages of non-missing values of these attributes:

             Total    Percent
LotFrontage    259  17.739726
GarageYrBlt     81   5.547945
MasVnrArea       8   0.547945
Electrical       1   0.068493

The number of attributes have missing values are 4

Data Handling for missing features listed above

LotFrontage case

We consider to fill up these missing values of houses with an average of total LotFrontage values of other houses with the same its Neighbborhood value.

In [110]:
# Print the 10 first missing LotFrontage cases
print(dftotal.loc[dftotal['LotFrontage'].isnull()][['LotFrontage','Neighborhood']][:10],'\n')
    LotFrontage Neighborhood
7           NaN       NWAmes
12          NaN       Sawyer
14          NaN        NAmes
16          NaN        NAmes
24          NaN       Sawyer
31          NaN       Sawyer
42          NaN      SawyerW
43          NaN      CollgCr
50          NaN      Gilbert
64          NaN      CollgCr 

In [111]:
# Calculate the the average LotFrontage attributes in each Neighborhood and display them
dftemp = dftotal.groupby('Neighborhood')['LotFrontage'].mean()
display(dftemp)
Neighborhood
Blmngtn    47.142857
Blueste    24.000000
BrDale     21.562500
BrkSide    57.509804
ClearCr    83.461538
CollgCr    71.682540
Crawfor    71.804878
Edwards    68.217391
Gilbert    79.877551
IDOTRR     62.500000
MeadowV    27.800000
Mitchel    70.083333
NAmes      76.462366
NPkVill    32.285714
NWAmes     81.288889
NoRidge    91.878788
NridgHt    81.881579
OldTown    62.788991
SWISU      58.913043
Sawyer     74.437500
SawyerW    71.500000
Somerst    64.666667
StoneBr    62.700000
Timber     80.133333
Veenker    59.714286
Name: LotFrontage, dtype: float64
In [112]:
# Map the value in each Neighborhood to the NaN value in LotFrotage. 
dftotal.loc[dftotal['LotFrontage'].isnull(),'LotFrontage'] = dftotal['Neighborhood'].map(dftemp)
dftrain.loc[dftrain['LotFrontage'].isnull(),'LotFrontage'] = dftrain['Neighborhood'].map(dftemp)
dftest.loc[dftest['LotFrontage'].isnull(),'LotFrontage'] = dftest['Neighborhood'].map(dftemp)
# Recheck the mapping values. 
dftotal[['LotFrontage','Neighborhood']].loc[[7,12,43,50]]
Out[112]:
LotFrontage Neighborhood
7 81.288889 NWAmes
12 74.437500 Sawyer
43 71.682540 CollgCr
50 79.877551 Gilbert
GarageYrBlt case
As most of the cases, more than 75% garage was built at the same year as the house was built. So impute the missing GarageYrBlt value with the YearBuilt value.
In [113]:
#Print out the Comparision between YearBuilt and GarageYrBlt Attributes
print("Display the first 5 rows just which 2 attributes YearBuilt and GarageYrBlt but GarageYrBlt having missing values")
display(dftotal.loc[dftotal['GarageYrBlt'].isnull()][['GarageYrBlt','YearBuilt']][:5])
print("Summary of the difference between YearBuilt and GarageYrBlt Attributes")
print((dftotal['GarageYrBlt']-dftotal['YearBuilt']).describe(), '\n')
Display the first 5 rows just which 2 attributes YearBuilt and GarageYrBlt but GarageYrBlt having missing values
GarageYrBlt YearBuilt
39 NaN 1955
48 NaN 1920
78 NaN 1968
88 NaN 1915
89 NaN 1994
Summary of the difference between YearBuilt and GarageYrBlt Attributes
count    1379.000000
mean        5.547498
std        16.580490
min       -10.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       123.000000
dtype: float64 

As you can see the result above, more than 75% of houses having garage was built at the same year as the house was built.
In [114]:
# Replace the missing GarageYrBlt values with the YearBuilt values
dftotal.loc[dftotal['GarageYrBlt'].isnull(),'GarageYrBlt'] = dftotal['YearBuilt']
dftrain.loc[dftrain['GarageYrBlt'].isnull(),'GarageYrBlt'] = dftrain['YearBuilt']
dftest.loc[dftest['GarageYrBlt'].isnull(),'GarageYrBlt'] = dftest['YearBuilt']

#Comparision between after and before value 
print("Display these 5 rows abow after fullfill the missing GarageYrBlt values")
display(dftotal[['GarageYrBlt','YearBuilt']].loc[[39,48,78,88,89]])
print("Summary of the difference between YearBuilt and GarageYrBlt Attributes after fullfill the missing GarageYrBlt values")
print((dftotal['GarageYrBlt']-dftotal['YearBuilt']).describe())
Display these 5 rows abow after fullfill the missing GarageYrBlt values
GarageYrBlt YearBuilt
39 1955.0 1955
48 1920.0 1920
78 1968.0 1968
88 1915.0 1915
89 1994.0 1994
Summary of the difference between YearBuilt and GarageYrBlt Attributes after fullfill the missing GarageYrBlt values
count    1460.000000
mean        5.239726
std        16.163661
min       -10.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       123.000000
dtype: float64
MasVnrArea case
all the 'MasVnrArea' == NaN when 'MasVnrType' == None, that means there are no veneer type in that mason so MasVnrArea should equal 0
In [115]:
# Check the 'MasVnrArea' NaN attribute when compared with 'MasVnrArea'
print("Display these rows just with 2 attributes: MasVnrArea and MasVnrArea which MasVnrArea having missing values")
dftotal.loc[dftotal['MasVnrArea'].isnull()][['MasVnrType','MasVnrArea']]
Display these rows just with 2 attributes: MasVnrArea and MasVnrArea which MasVnrArea having missing values
Out[115]:
MasVnrType MasVnrArea
234 None NaN
529 None NaN
650 None NaN
936 None NaN
973 None NaN
977 None NaN
1243 None NaN
1278 None NaN
In [116]:
# Replace MasVnrArea NaN values with 0.
dftotal['MasVnrArea'] = dftotal['MasVnrArea'].fillna(0)
dftrain['MasVnrArea'] = dftrain['MasVnrArea'].fillna(0)
dftest['MasVnrArea'] = dftest['MasVnrArea'].fillna(0)
print("Display these rows abow after fullfill the missing MasVnrArea values")
dftotal[['MasVnrType','MasVnrArea']].loc[[234,529,650,936]]
Display these rows abow after fullfill the missing MasVnrArea values
Out[116]:
MasVnrType MasVnrArea
234 None 0.0
529 None 0.0
650 None 0.0
936 None 0.0
Electrical case
There is the only one house missing the Electrical attribute and SBrkr value dominates so replace this missing value with SBrkr.
In [117]:
print("Display all rows having missing Electrical attribute value")
display(dftotal.loc[dftotal['Electrical'].isnull()])
print("Display categories exist of the Electrical attribute and how many houses belong to each category")
print(dftotal.groupby('Electrical')['Id'].count())
Display all rows having missing Electrical attribute value
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
1379 754 640 0 None 3 1Fam TA No 0 0 ... WD 0 Pave 7 384 AllPub 100 2006 2007 2008

1 rows × 80 columns

Display categories exist of the Electrical attribute and how many houses belong to each category
Electrical
FuseA      94
FuseF      27
FuseP       3
Mix         1
SBrkr    1334
Name: Id, dtype: int64
In [118]:
# Replace the Electrical value with SBrkr and re-check
dftotal['Electrical'] = dftotal['Electrical'].fillna('SBrkr')
dftest['Electrical'] = dftest['Electrical'].fillna('SBrkr')
(dftotal.loc[dftotal['Electrical'].isnull()])
Out[118]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold

0 rows × 80 columns

  • *Final check for Missing Values* case
In [119]:
# Check if there are any missing values left
all_data_na = (dftotal.isnull().sum() / len(dftotal)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
print(missing_data.head())

all_data_na = (dftrain.isnull().sum() / len(dftotal)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
print(missing_data.head())

all_data_na = (dftest.isnull().sum() / len(dftotal)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
print(missing_data.head())
Empty DataFrame
Columns: [Missing Ratio]
Index: []
Empty DataFrame
Columns: [Missing Ratio]
Index: []
Empty DataFrame
Columns: [Missing Ratio]
Index: []

3. Attribute Combination

We combine some attributes which as went it would be more meaningful when put together
  • TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF + GrLivArea
  • TotalBath = FullBath + BsmtFullBath + 0.5*(HalfBath + BsmtHalfBath)
  • AgeAtSell = YrSold - YearBuilt
  • RemodAgeAtSell = YrSold - YearRemodAdd
  • TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF - This means we can drop following attributes: BsmtFinSF1, BsmtFinSF2, BsmtUnfSF
  • We drop both MoSold and YrSold
In [120]:
#Prove that TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF
#dftotal['BsmtFinSF1'] + dftotal['BsmtFinSF2'] + dftotal['BsmtUnfSF']
print((dftotal['BsmtFinSF1'] + dftotal['BsmtFinSF2'] + dftotal['BsmtUnfSF']).describe(),'\n')
print(dftotal['TotalBsmtSF'].describe())
dftotal.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1, inplace=True)
dftest.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1, inplace=True)  
count    1460.000000
mean     1057.429452
std       438.705324
min         0.000000
25%       795.750000
50%       991.500000
75%      1298.250000
max      6110.000000
dtype: float64 

count    1460.000000
mean     1057.429452
std       438.705324
min         0.000000
25%       795.750000
50%       991.500000
75%      1298.250000
max      6110.000000
Name: TotalBsmtSF, dtype: float64
In [121]:
dftotal['TotalSF'] = dftotal['TotalBsmtSF'] + dftotal['1stFlrSF'] + dftotal['2ndFlrSF'] + dftotal['GrLivArea']
#dftest['TotalSF'] = dftest['TotalBsmtSF'] + dftest['1stFlrSF'] + dftest['2ndFlrSF'] + dftest['GrLivArea']
dftotal.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea'], axis=1, inplace=True)
#dftest.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea'], axis=1, inplace=True)

dftotal['TotalBath'] = dftotal['FullBath'] + dftotal['BsmtFullBath'] + 0.5*(dftotal['HalfBath'] + dftotal['BsmtHalfBath'])
#dftest['TotalBath'] = dftest['FullBath'] + dftest['BsmtFullBath'] + 0.5*(dftest['HalfBath'] + dftest['BsmtHalfBath'])
dftotal.drop(['FullBath', 'BsmtFullBath', 'HalfBath', 'BsmtHalfBath'], axis=1, inplace=True)
#dftest.drop(['FullBath', 'BsmtFullBath', 'HalfBath', 'BsmtHalfBath'], axis=1, inplace=True)

dftotal['AgeAtSell'] = abs(dftotal['YrSold'] -  dftotal['YearBuilt'])
#dftest['AgeAtSell'] = abs(dftest['YrSold'] - dftest['YearBuilt'])
                                                                         
dftotal['RemodAgeAtSell'] = abs(dftotal['YrSold'] -  dftotal['YearRemodAdd'])
#dftest['RemodAgeAtSell'] = abs(dftest['YrSold'] - dftest['YearRemodAdd'])

dftotal.drop(['YrSold', 'YearRemodAdd','MoSold','YearBuilt'], axis=1, inplace=True)
#dftest.drop(['YrSold', 'YearRemodAdd','MoSold','YearBuilt'], axis=1, inplace=True)                                                                                                                     

4. Handling the Numerical and Categorical Data

4.1 Handling with the numerical data

  • Checking Numerical Data
In [122]:
# Check the Categorical and Numerical data
# Checking Numerical Data
# dftotal._get_numeric_data().info()
num_col=(dftotal.select_dtypes(include=['int64','float64']).columns)
num_col = num_col.drop(['Id'])
print("The number of categoraical variables:", len(num_col),'\n')
print(num_col)
The number of categoraical variables: 25 

Index(['3SsnPorch', 'BedroomAbvGr', 'EnclosedPorch', 'Fireplaces',
       'GarageArea', 'GarageCars', 'GarageYrBlt', 'KitchenAbvGr', 'LotArea',
       'LotFrontage', 'LowQualFinSF', 'MSSubClass', 'MasVnrArea', 'MiscVal',
       'OpenPorchSF', 'OverallCond', 'OverallQual', 'PoolArea', 'ScreenPorch',
       'TotRmsAbvGrd', 'WoodDeckSF', 'TotalSF', 'TotalBath', 'AgeAtSell',
       'RemodAgeAtSell'],
      dtype='object')
In [123]:
dftotal[num_col].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 25 columns):
3SsnPorch         1460 non-null int64
BedroomAbvGr      1460 non-null int64
EnclosedPorch     1460 non-null int64
Fireplaces        1460 non-null int64
GarageArea        1460 non-null int64
GarageCars        1460 non-null int64
GarageYrBlt       1460 non-null float64
KitchenAbvGr      1460 non-null int64
LotArea           1460 non-null int64
LotFrontage       1460 non-null float64
LowQualFinSF      1460 non-null int64
MSSubClass        1460 non-null int64
MasVnrArea        1460 non-null float64
MiscVal           1460 non-null int64
OpenPorchSF       1460 non-null int64
OverallCond       1460 non-null int64
OverallQual       1460 non-null int64
PoolArea          1460 non-null int64
ScreenPorch       1460 non-null int64
TotRmsAbvGrd      1460 non-null int64
WoodDeckSF        1460 non-null int64
TotalSF           1460 non-null int64
TotalBath         1460 non-null float64
AgeAtSell         1460 non-null int64
RemodAgeAtSell    1460 non-null int64
dtypes: float64(4), int64(21)
memory usage: 285.2 KB
In [124]:
#Statistic description of all the numerical and sort by count in order 
dftotal[num_col].describe().T.sort_values(by='count')
Out[124]:
count mean std min 25% 50% 75% max
3SsnPorch 1460.0 3.409589 29.317331 0.0 0.0 0.000000 0.00 508.0
TotalBath 1460.0 2.210616 0.785399 1.0 2.0 2.000000 2.50 6.0
TotalSF 1460.0 4082.512329 1306.309133 668.0 3168.0 3941.500000 4768.50 17394.0
WoodDeckSF 1460.0 94.244521 125.338794 0.0 0.0 0.000000 168.00 857.0
TotRmsAbvGrd 1460.0 6.517808 1.625393 2.0 5.0 6.000000 7.00 14.0
ScreenPorch 1460.0 15.060959 55.757415 0.0 0.0 0.000000 0.00 480.0
PoolArea 1460.0 2.758904 40.177307 0.0 0.0 0.000000 0.00 738.0
OverallQual 1460.0 6.099315 1.382997 1.0 5.0 6.000000 7.00 10.0
OverallCond 1460.0 5.575342 1.112799 1.0 5.0 5.000000 6.00 9.0
OpenPorchSF 1460.0 46.660274 66.256028 0.0 0.0 25.000000 68.00 547.0
MiscVal 1460.0 43.489041 496.123024 0.0 0.0 0.000000 0.00 15500.0
AgeAtSell 1460.0 36.547945 30.250152 0.0 8.0 35.000000 54.00 136.0
MasVnrArea 1460.0 103.117123 180.731373 0.0 0.0 0.000000 164.25 1600.0
LowQualFinSF 1460.0 5.844521 48.623081 0.0 0.0 0.000000 0.00 572.0
LotFrontage 1460.0 70.725218 22.426978 21.0 60.0 70.083333 80.00 313.0
LotArea 1460.0 10516.828082 9981.264932 1300.0 7553.5 9478.500000 11601.50 215245.0
KitchenAbvGr 1460.0 1.046575 0.220338 0.0 1.0 1.000000 1.00 3.0
GarageYrBlt 1460.0 1976.507534 26.306739 1872.0 1959.0 1978.000000 2001.00 2010.0
GarageCars 1460.0 1.767123 0.747315 0.0 1.0 2.000000 2.00 4.0
GarageArea 1460.0 472.980137 213.804841 0.0 334.5 480.000000 576.00 1418.0
Fireplaces 1460.0 0.613014 0.644666 0.0 0.0 1.000000 1.00 3.0
EnclosedPorch 1460.0 21.954110 61.119149 0.0 0.0 0.000000 0.00 552.0
BedroomAbvGr 1460.0 2.866438 0.815778 0.0 2.0 3.000000 3.00 8.0
MSSubClass 1460.0 56.897260 42.300571 20.0 20.0 50.000000 70.00 190.0
RemodAgeAtSell 1460.0 22.951370 20.639129 0.0 4.0 14.000000 41.00 60.0
  • Scatter plots for these attributes listed above with SalePrice
In [125]:
# scatter plots
tempdata = pd.concat([dftotal[:len(dftrain)], dftrain['SalePrice']], axis=1) #get the train data - data frame 

temp = pd.melt(tempdata, id_vars=['SalePrice'],value_vars=num_col)
grid = sns.FacetGrid(temp, col="variable",  col_wrap=4 , height=3.0, 
                     aspect=1.2,sharex=False, sharey=False)
grid.map(plt.scatter, "value",'SalePrice', s=3,color='navy')
plt.show()
There are 3 attributes that actually categorical not numerical containing: 'MSSubClass', 'OverallCond', 'OverallQual'

  • Drop those non categorical columns
In [126]:
# drop those non categorical columns
num_col = num_col.drop(['MSSubClass', 'OverallCond', 'OverallQual'])
  • Visualize the distribution of each numerical feature
In [127]:
# visualize the distribution of each numerical feature
temp = pd.melt(dftotal, value_vars=num_col)
grid = sns.FacetGrid(temp, col="variable",  col_wrap=5 , height=3.0, 
                     aspect=1.0,sharex=False, sharey=False)
grid.map(sns.distplot, "value",color='navy')
plt.show()
Each attribute has different distributions. For example, feature AgeAtSell has right skew distribution
  • Plot the boxplot for each numerical feature
In [128]:
# Plot the boxplot of each attribute compared with the SalePrice
def chunks(l, n):
    return [l[i:i + n] for i in range(0, len(l), n)]

def boxplot(df, cols, ncols):
    for lst in chunks(cols, ncols):
        fig, axes = plt.subplots(nrows=1, ncols=ncols, figsize=(10, 4), dpi=200)
        sns.set(font_scale = 0.7)
        for idx in range(0, len(lst)):
            attr = lst[idx]
            data = pd.concat([df['SalePrice'], df[attr]], axis=1)
            sns.set_palette('Paired',30)
            g = sns.boxplot(data=df[attr], ax=axes[idx],fliersize=0.5, linewidth=0.5, color='navy').set_title(attr)
#             for item in g.get_xticklabels():
#                 item.set_rotation(75)
        plt.tight_layout()

boxplot(tempdata, num_col, 3)
Almost numerical features have very different scales and contain some very large outliers. Therefore, later we rescales them
  • Plot the correlation matrix for numerical attributes
In [129]:
## CORRELATION MATRIX FOR NUMERICAL ATTRIBUTES

# BUILD THE CORRELATION MATRIX BETWEEN NUMERIC (int64 and float64) ATTRIBUTES
correlationMatrix = dftotal[num_col].corr(method='spearman')

# PLOT THE CORRELATION MATRIX
plt.figure(figsize=(25,25))
plt.title("Correlation matrix between numerical attributes", weight="semibold")
ax.title.set_fontsize(20)

# Build the Color Correlation Matrix
mask = np.zeros_like(correlationMatrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.set(font_scale = 1.3)
g = sns.heatmap(correlationMatrix, cmap='Blues', fmt = '.2f', square = True, 
                mask=mask, annot=True, annot_kws={"size":14}, linewidths=1.0)

for text in g.texts:
    t = float(text.get_text())
    if ((t) < 0.65):
        text.set_text('')
    else:
        text.set_text(round(t, 4))

# Build the Values Correlation Matrix
mask[np.triu_indices_from(mask)] = False
mask[np.tril_indices_from(mask)] = True
g = sns.heatmap(correlationMatrix, cmap=ListedColormap(['white']), square = True, fmt = '.1f', 
                linewidths=1.0, mask=mask, annot=True, annot_kws={"size":12}, cbar=False);
g.set_xticklabels(g.get_xticklabels(), rotation=60, ha="right");
In [130]:
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
print("Display first 10 pairs of features with high correlation:")
correlationMatrix.where(np.triu(np.ones(correlationMatrix.shape), k=1)
                        .astype(np.bool)).stack().sort_values(ascending=False)[:10]
Display first 10 pairs of features with high correlation:
Out[130]:
GarageArea    GarageCars        0.853317
TotRmsAbvGrd  TotalSF           0.756027
AgeAtSell     RemodAgeAtSell    0.692686
BedroomAbvGr  TotRmsAbvGrd      0.667822
GarageCars    GarageYrBlt       0.648543
TotalSF       TotalBath         0.628322
GarageArea    GarageYrBlt       0.616982
LotArea       LotFrontage       0.615459
GarageCars    TotalSF           0.566426
GarageArea    TotalSF           0.547319
dtype: float64

As we know, features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we consider drop one of the two features.

  • Plot the correlation list for numerical attributes with SalePrice attribute
In [131]:
sale_num_col = num_col.insert(0,'SalePrice')
correlationMatrix = tempdata[sale_num_col].corr(method='spearman').abs()

sorted_corr = (correlationMatrix.loc[:, ['SalePrice']] # Select the SalePrice line in the correlation tab
    .sort_values(by='SalePrice', ascending=False).T) # Sort values by descending correlation coef with SalePrice


# Plot the heatmap
plt.figure(figsize=(27, 1))
ax = sns.heatmap(sorted_corr, 
                 # Annotations options
                 annot=True, annot_kws={'size':15, 'weight':'bold'}, fmt='.2f', 
                 # Display options
                 linewidths=1, cbar=False, cmap='Blues')

# Resize the labels
for label in ax.get_xticklabels()+ax.get_yticklabels():
    label.set_rotation(75)
    label.set_fontsize(15)
    
plt.title("The Correlations coefficients between SalePrice and the Numerical Attributes")
ax.title.set_fontsize(20)

plt.show()
The top 20 most significant attributes having high correlation with SalePrice are: TotalSF, TotalBath, GarageCars, AgeAtSell, GarageArea, GarageYrBlt, RemodAgeAtSell, TotRmsAbvGrd, Fireplaces, OpenPorchSF, LotArea, LotFrontage, MasVnrArea, WoodDeckSF, BedroomAbvGr, EnclosedPorch, KitchenAbvGr, ScreenPorch, MiscVal, LowQualFinSF.

4.2 Handling with Categorical data

  • Checking Categorical data
In [132]:
# Checking Categorical Data and adding those attributes was previous considered as numercial attribbutes
cat_col = ["MSSubClass", "OverallCond", "OverallQual"]
dftotal[cat_col] = dftotal[cat_col].astype('object')
cat_col = dftotal.select_dtypes(include=['object']).columns
#cat_col = cat_col.insert(0,"MoSold", "YrSold", "MSSubClass", "OverallCond", "OverallQual")
print("Number of categoraical variables:", len(cat_col),'\n')
print(cat_col)
Number of categoraical variables: 46 

Index(['Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd',
       'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC',
       'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig',
       'LotShape', 'MSSubClass', 'MSZoning', 'MasVnrType', 'MiscFeature',
       'Neighborhood', 'OverallCond', 'OverallQual', 'PavedDrive', 'PoolQC',
       'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'Street',
       'Utilities'],
      dtype='object')

Accroding to Data Description file - the categorical data as it have two types:

  • Nomial categories

    • MSSubClass, MSZoning, Street, Alley, LandContour, LotConfig, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1, Exterior2, MasVnrType, MasVnrArea, Foundation, Heating, CentralAir, GarageType, MiscFeature, SaleType, SaleCondition
  • Ordinal Categories

    • LotShape, Utilities, LandSlope, OverallQual, OverallCond, ExterQual, ExterCond, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, HeatingQC, Electrical, KitchenQual, Functional, FireplaceQu, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, Fence
In [133]:
Nom_cat_col = ["MSSubClass", "MSZoning", "Street", "Alley", "LandContour", "LotConfig", "Neighborhood", 
               "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1", 
               "Exterior2", "MasVnrType", "MasVnrArea", "Foundation", "Heating", "CentralAir", "GarageType", 
               "MiscFeature", "SaleType", "SaleCondition"]
Ord_cat_col = ["MSSubClass", "MSZoning", "Street", "Alley", "LandContour", "LotConfig", "Neighborhood", 
               "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1", 
               "Exterior2", "MasVnrType", "MasVnrArea", "Foundation", "Heating", "CentralAir", "GarageType", 
               "MiscFeature", "SaleType", "SaleCondition"]
In [134]:
print("Summary of each categorical attribute")
dftotal[cat_col].describe().T.sort_values(by=['count'],ascending=True)
Summary of each categorical attribute
Out[134]:
count unique top freq
Alley 1460 3 None 1369
HouseStyle 1460 8 1Story 726
KitchenQual 1460 4 TA 735
LandContour 1460 4 Lvl 1311
LandSlope 1460 3 Gtl 1382
LotConfig 1460 5 Inside 1052
LotShape 1460 4 Reg 925
MSSubClass 1460 15 20 536
MSZoning 1460 5 RL 1151
MasVnrType 1460 4 None 872
MiscFeature 1460 5 None 1406
Neighborhood 1460 25 NAmes 225
OverallCond 1460 9 5 821
OverallQual 1460 10 5 397
PavedDrive 1460 3 Y 1340
PoolQC 1460 4 None 1453
RoofMatl 1460 8 CompShg 1434
RoofStyle 1460 6 Gable 1141
SaleCondition 1460 6 Normal 1198
SaleType 1460 9 WD 1267
HeatingQC 1460 5 Ex 741
Heating 1460 6 GasA 1428
GarageType 1460 7 Attchd 870
GarageQual 1460 6 TA 1311
BldgType 1460 5 1Fam 1220
BsmtCond 1460 5 TA 1311
BsmtExposure 1460 5 No 953
BsmtFinType1 1460 7 Unf 430
BsmtFinType2 1460 7 Unf 1256
BsmtQual 1460 5 TA 649
CentralAir 1460 2 Y 1365
Condition1 1460 9 Norm 1260
Condition2 1460 8 Norm 1445
Street 1460 2 Pave 1454
Electrical 1460 5 SBrkr 1335
ExterQual 1460 4 TA 906
Exterior1st 1460 15 VinylSd 515
Exterior2nd 1460 16 VinylSd 504
Fence 1460 5 None 1179
FireplaceQu 1460 6 None 690
Foundation 1460 6 PConc 647
Functional 1460 7 Typ 1360
GarageCond 1460 6 TA 1326
GarageFinish 1460 4 Unf 605
ExterCond 1460 5 TA 1282
Utilities 1460 2 AllPub 1459
  • Plot the distribution of each attributes
In [136]:
# Plot the distribbution of each attributes    
def chunks(l, n):
    return [l[i:i + n] for i in range(0, len(l), n)]

def histplot(df, cols, ncols):
    for lst in chunks(cols, ncols):
        fig, axes = plt.subplots(nrows=1, ncols=ncols, figsize=(10, 3), dpi=200)
        sns.set(font_scale = 0.5)
        for idx in range(0, len(lst)):
            attr = lst[idx]
            data = df[attr]
            sns.set_palette('Paired',30)
            g = sns.countplot(x=attr, data=df, ax=axes[idx])
            for item in g.get_xticklabels():
                item.set_rotation(75)
        plt.tight_layout()

histplot(dftotal, cat_col, 4)
Each attribute has different distributions and its values distribute unequally. For example, the TA values dominate in the attribute GarageQual.
  • Scatter plots for these attributes listed above with SalePrice
In [38]:
# scatter plots
tempdata = pd.concat([dftotal[:len(dftrain)], dftrain['SalePrice']], axis=1) #get the train data - data frame 

temp = pd.melt(tempdata, id_vars=['SalePrice'],value_vars=cat_col)
sns.set(font_scale = 0.8)
grid = sns.FacetGrid(temp, col="variable",  col_wrap=4 , height=3.0, 
                     aspect=1.2,sharex=False, sharey=False).set_xticklabels(rotation=90)

grid.map(plt.scatter, "value",'SalePrice', s=3,color='navy')
plt.show()
  • Plot the boxplot for each categorical feature
In [39]:
# Plot the boxplot of each attribute compared with the SalePrice
def chunks(l, n):
    return [l[i:i + n] for i in range(0, len(l), n)]

def boxplot(df, cols, ncols):
    for lst in chunks(cols, ncols):
        fig, axes = plt.subplots(nrows=1, ncols=ncols, figsize=(10, 4), dpi=200)
        sns.set(font_scale = 0.7)
        for idx in range(0, len(lst)):
            attr = lst[idx]
            data = pd.concat([df['SalePrice'], df[attr]], axis=1)
            sns.set_palette('Paired',30)
            g = sns.boxplot(x=attr, y='SalePrice', data=data, ax=axes[idx], fliersize=0.5, linewidth=0.5)
            for item in g.get_xticklabels():
                item.set_rotation(75)
        plt.tight_layout()

boxplot(tempdata, cat_col, 3)
  • Categorical Correlation
    According to Pearson’s R Correlation test and it isn’t defined when the data is categorical, the most common solution is to use one-hot encoding, and break each possible option of each categorical feature to 0-or-1 features. However, due to the big amount of 46 attributes, it would lead to a huge and fragmented correlation matrix. Alternatively, we use the Uncertainty Coefficient, it is built for categorical attributes and use asymmetric measure. Furthermore, we also use Correlation Ratio to build the correlation matrix between categorical and numerical attributes

    Ref The Search for Categorical Correlation

In [40]:
# Define Functions
def convert(data, to):
    converted = None
    if to == 'array':
        if isinstance(data, np.ndarray):
            converted = data
        elif isinstance(data, pd.Series):
            converted = data.values
        elif isinstance(data, list):
            converted = np.array(data)
        elif isinstance(data, pd.DataFrame):
            converted = data.as_matrix()
    elif to == 'list':
        if isinstance(data, list):
            converted = data
        elif isinstance(data, pd.Series):
            converted = data.values.tolist()
        elif isinstance(data, np.ndarray):
            converted = data.tolist()
    elif to == 'dataframe':
        if isinstance(data, pd.DataFrame):
            converted = data
        elif isinstance(data, np.ndarray):
            converted = pd.DataFrame(data)
    else:
        raise ValueError("Unknown data conversion: {}".format(to))
    if converted is None:
        raise TypeError('cannot handle data conversion of type: {} to {}'.format(type(data),to))
    else:
        return converted

def cramers_v(x, y):
    """
    Calculates Cramer's V statistic for categorical-categorical association.
    Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
    This is a symmetric coefficient: V(x,y) = V(y,x)
    Original function taken from: https://stackoverflow.com/a/46498792/5863503
    Wikipedia: https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V
    :param x: list / NumPy ndarray / Pandas Series
        A sequence of categorical measurements
    :param y: list / NumPy ndarray / Pandas Series
        A sequence of categorical measurements
    :return: float
        in the range of [0,1]
    """
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

def conditional_entropy(x, y):
    """
    Calculates the conditional entropy of x given y: S(x|y)
    Wikipedia: https://en.wikipedia.org/wiki/Conditional_entropy
    :param x: list / NumPy ndarray / Pandas Series
        A sequence of measurements
    :param y: list / NumPy ndarray / Pandas Series
        A sequence of measurements
    :return: float
    """
    # entropy of x given y
    y_counter = Counter(y)
    xy_counter = Counter(list(zip(x,y)))
    total_occurrences = sum(y_counter.values())
    entropy = 0.0
    for xy in xy_counter.keys():
        p_xy = xy_counter[xy] / total_occurrences
        p_y = y_counter[xy[1]] / total_occurrences
        entropy += p_xy * math.log(p_y/p_xy)
    return entropy

def theils_u(x, y):
    """
    Calculates Theil's U statistic (Uncertainty coefficient) for categorical-categorical association.
    This is the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about
    x, and 1 means y provides full information about x.
    This is an asymmetric coefficient: U(x,y) != U(y,x)
    Wikipedia: https://en.wikipedia.org/wiki/Uncertainty_coefficient
    :param x: list / NumPy ndarray / Pandas Series
        A sequence of categorical measurements
    :param y: list / NumPy ndarray / Pandas Series
        A sequence of categorical measurements
    :return: float
        in the range of [0,1]
    """
    s_xy = conditional_entropy(x,y)
    x_counter = Counter(x)
    total_occurrences = sum(x_counter.values())
    p_x = list(map(lambda n: n/total_occurrences, x_counter.values()))
    s_x = ss.entropy(p_x)
    if s_x == 0:
        return 1
    else:
        return (s_x - s_xy) / s_x

def correlation_ratio(categories, measurements):
    """
    Calculates the Correlation Ratio (sometimes marked by the greek letter Eta) for categorical-continuous association.
    Answers the question - given a continuous value of a measurement, is it possible to know which category is it
    associated with?
    Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means
    a category can be determined with absolute certainty.
    Wikipedia: https://en.wikipedia.org/wiki/Correlation_ratio
    :param categories: list / NumPy ndarray / Pandas Series
        A sequence of categorical measurements
    :param measurements: list / NumPy ndarray / Pandas Series
        A sequence of continuous measurements
    :return: float
        in the range of [0,1]
    """
    categories = convert(categories, 'array')
    measurements = convert(measurements, 'array')
    fcat, _ = pd.factorize(categories)
    cat_num = np.max(fcat)+1
    y_avg_array = np.zeros(cat_num)
    n_array = np.zeros(cat_num)
    for i in range(0,cat_num):
        cat_measures = measurements[np.argwhere(fcat == i).flatten()]
        n_array[i] = len(cat_measures)
        y_avg_array[i] = np.average(cat_measures)
    y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
    numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
    denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
    if numerator == 0:
        eta = 0.0
    else:
        eta = numerator/denominator
    return eta
  • Plot the correlation matrix for categorical attributes
In [41]:
## Correlation matrix for categorical attributes
correlationMatrix = pd.DataFrame(index=cat_col, columns=cat_col)
for col in cat_col:
    for row in cat_col:
        temp = theils_u(dftotal[col], dftotal[row])
        correlationMatrix[row][col] = temp
        correlationMatrix[col][row] = temp
        
correlationMatrix = correlationMatrix.astype(float)
# Plot the correlation matrix
plt.figure(figsize=(30,30))
plt.title("Correlation matrix between categorical attributes", weight="semibold")
ax.title.set_fontsize(20)

# Build the Color Correlation Matrix
mask = np.zeros_like(correlationMatrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
g = sns.heatmap(correlationMatrix, cmap='Blues', fmt = '.2f', square = True, 
                mask=mask, annot=True, annot_kws={"size":10}, linewidths=1.0);

for text in g.texts:
    t = float(text.get_text())
    if ((t) < 0.5):
        text.set_text('')
    else:
        text.set_text(round(t, 4))

# Build the Values Correlation Matrix
mask[np.triu_indices_from(mask)] = False
mask[np.tril_indices_from(mask)] = True
g = sns.heatmap(correlationMatrix, cmap=ListedColormap(['white']), square = True, fmt = '.1f', 
                linewidths=1.0, mask=mask, annot=True, annot_kws={"size":10}, cbar=False);
g.set_xticklabels(g.get_xticklabels(), rotation=60, ha="right");

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
print("Display first 10 pairs of features with high correlation:")
correlationMatrix.where(np.triu(np.ones(correlationMatrix.shape), k=1)
                        .astype(np.bool)).stack().sort_values(ascending=False)[:10]
Display first 10 pairs of features with high correlation:
Out[41]:
Exterior1st    Exterior2nd    0.798545
GarageCond     GarageQual     0.654309
SaleCondition  SaleType       0.581416
HouseStyle     MSSubClass     0.573216
GarageFinish   GarageQual     0.552086
GarageType     Utilities      0.527188
Condition1     Condition2     0.463729
Neighborhood   Utilities      0.441903
SaleType       Utilities      0.426796
BsmtFinType2   Utilities      0.399015
dtype: float64
  • Plot the correlation list for categorical attributes with SalePrice attribute
In [42]:
# Pick out the top 20 attributes that correlated with SalePrice
sale_cat_col = cat_col.insert(0,'SalePrice')
sorted_corr = pd.DataFrame(index=['SalePrice'], columns=sale_cat_col)

for col in sale_cat_col:
    sorted_corr[col]['SalePrice'] = correlation_ratio(tempdata[col], tempdata['SalePrice'])

# # Sort values by descending correlation coef with SalePrice
sorted_corr = sorted_corr.T.sort_values(by='SalePrice',ascending=False).T

sorted_corr = sorted_corr.loc['SalePrice'][:20].to_frame().T.astype(float)

# Plot the heatmap
plt.figure(figsize=(20, 1))
ax = sns.heatmap(sorted_corr, 
                 # Annotations options
                 annot=True, annot_kws={'size':15, 'weight':'bold'}, fmt='.2f', 
                 # Display options
                 linewidths=1, cbar=False, cmap='Blues')

# Resize the labels
for label in ax.get_xticklabels()+ax.get_yticklabels():
    label.set_rotation(75)
    label.set_fontsize(15)
    
plt.title("The Correlations coefficients between SalePrice and the top 20 Categorical Attributes")
ax.title.set_fontsize(20)

plt.show()
The top 20 most significant attributes having high correlation with SalePrice are: OverallQual, Neighborhood, ExterQual, BsmtQual, KitchenQual, FireplaceQu, MSSubClass, Foundation, GarageType, BsmtFinType1, HeatingQC, MasVnrType, SaleType, Exterior1st, Exterior2nd, SaleCondition, BsmtExposure, OverallCond.
  • Transforming the Categorical Data
    We will transform the categorical data into dummy variables and also drop one column from each of them to avoid dummy variable trap
In [43]:
# create of list of dummy variables that I will drop, which will be the last
# column generated from each categorical feature
dummy_drop = []
for i in cat_col:
    dummy_drop += [ i+'_'+str(dftotal[i].unique()[-1]) ]
# create dummy variables
dftotal = pd.get_dummies(dftotal,columns=cat_col) 
# drop the lasdt column generated from each categorical feature
dftotal = dftotal.drop(dummy_drop,axis=1)
print(format(dftotal.shape))
dftotal[:10]
(1460, 277)
Out[43]:
3SsnPorch BedroomAbvGr EnclosedPorch Fireplaces GarageArea GarageCars GarageYrBlt Id KitchenAbvGr LotArea ... SaleType_COD SaleType_CWD SaleType_Con SaleType_ConLD SaleType_ConLI SaleType_ConLw SaleType_New SaleType_WD Street_Pave Utilities_AllPub
0 0 3 0 0 548 2 2003.0 1 1 8450 ... 0 0 0 0 0 0 0 1 1 1
1 0 3 0 1 460 2 1976.0 2 1 9600 ... 0 0 0 0 0 0 0 1 1 1
2 0 3 0 1 608 2 2001.0 3 1 11250 ... 0 0 0 0 0 0 0 1 1 1
3 0 3 272 1 642 3 1998.0 4 1 9550 ... 0 0 0 0 0 0 0 1 1 1
4 0 4 0 1 836 3 2000.0 5 1 14260 ... 0 0 0 0 0 0 0 1 1 1
5 320 1 0 0 480 2 1993.0 6 1 14115 ... 0 0 0 0 0 0 0 1 1 1
6 0 3 0 1 636 2 2004.0 7 1 10084 ... 0 0 0 0 0 0 0 1 1 1
7 0 3 228 2 484 2 1973.0 8 1 10382 ... 0 0 0 0 0 0 0 1 1 1
8 0 2 205 2 468 2 1931.0 9 2 6120 ... 0 0 0 0 0 0 0 1 1 1
9 0 2 0 2 205 1 1939.0 10 2 7420 ... 0 0 0 0 0 0 0 1 1 1

10 rows × 277 columns


5. Feature Selection by XGBoost

Calculate feature importance using the the XGBoost model (the gradient boosting algorithm)
In [44]:
X_train = dftotal[:len(dftrain)].drop(['Id'], axis=1)
y_train = np.log(dftrain['SalePrice'])
X_test = dftotal[len(dftrain):].drop(['Id'], axis=1)
X_train.shape, y_train.shape, X_test.shape 
Out[44]:
((1200, 276), (1200,), (260, 276))
Use the RobustScaler to scale the outliers of numerical attributes
In [45]:
# fit the training set only, then transform both the training and test sets
scaler = RobustScaler()
X_train[num_col]= scaler.fit_transform(X_train[num_col])
X_test[num_col]= scaler.transform(X_test[num_col])
In [46]:
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
imp = pd.DataFrame(xgb.feature_importances_ ,columns = ['Importance'],index = X_train.columns)
imp = imp.sort_values(['Importance'], ascending = False)

print(imp)
                      Importance
ExterQual_TA            0.131174
GarageCars              0.124508
AgeAtSell               0.074292
TotalSF                 0.058756
Fireplaces              0.037856
TotalBath               0.031720
KitchenQual_Ex          0.030432
GarageQual_TA           0.027218
GarageType_Attchd       0.024438
BsmtQual_Ex             0.024009
MSZoning_RL             0.023246
CentralAir_Y            0.020851
RemodAgeAtSell          0.019601
BsmtExposure_Gd         0.017849
MSZoning_RM             0.017354
PavedDrive_Y            0.016340
KitchenAbvGr            0.012766
BldgType_1Fam           0.011886
GarageCond_TA           0.011537
HeatingQC_Ex            0.010324
KitchenQual_TA          0.009074
OverallQual_4           0.008655
ExterCond_Fa            0.008605
OverallCond_2           0.008533
BsmtFinType1_GLQ        0.008456
GarageArea              0.008385
Neighborhood_Crawfor    0.007924
MSZoning_C (all)        0.007683
OverallCond_3           0.006693
LotArea                 0.006313
...                          ...
HouseStyle_1Story       0.000000
HouseStyle_SFoyer       0.000000
LandContour_Low         0.000000
Functional_Min2         0.000000
Foundation_Slab         0.000000
Exterior1st_WdShing     0.000000
Exterior2nd_Stone       0.000000
Exterior2nd_AsbShng     0.000000
Exterior2nd_AsphShn     0.000000
Exterior2nd_Brk Cmn     0.000000
Exterior2nd_BrkFace     0.000000
Exterior2nd_CmentBd     0.000000
Exterior2nd_HdBoard     0.000000
Exterior2nd_ImStucc     0.000000
Exterior2nd_MetalSd     0.000000
Exterior2nd_Other       0.000000
Exterior2nd_Plywood     0.000000
Exterior2nd_Stucco      0.000000
Foundation_PConc        0.000000
Exterior2nd_VinylSd     0.000000
Exterior2nd_Wd Sdng     0.000000
Exterior2nd_Wd Shng     0.000000
Fence_GdPrv             0.000000
FireplaceQu_Ex          0.000000
FireplaceQu_Fa          0.000000
FireplaceQu_Gd          0.000000
FireplaceQu_None        0.000000
FireplaceQu_TA          0.000000
Foundation_CBlock       0.000000
Utilities_AllPub        0.000000

[276 rows x 1 columns]
In [47]:
print("Display the level of importance of the attributes:")
imp[:60].sort_values('Importance').plot(kind="barh",figsize=(15,25), color='navy')
plt.xticks(rotation=90)
plt.show()
Display the level of importance of the attributes:
As we see, almost important features calculated by by XGBoost are The top 20 most significant numerical & categorical attributes having high correlation with SalePrice listed previously. This makes us more confident of results we get from XGBoost model.

Now we can use RFECV to eliminate the redundant features.

In [48]:
# Define a function to calculate RMSE
def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true-y_pred)**2))

# Define a function to calculate negative RMSE (as a score)
def nrmse(y_true, y_pred):
    return -1.0*rmse(y_true, y_pred)

neg_rmse = make_scorer(nrmse)

estimator = XGBRegressor()
selector = RFECV(estimator, cv = 3, n_jobs = -1, scoring = neg_rmse)
selector = selector.fit(X_train, y_train)

print("The number of selected features is: {}".format(selector.n_features_))

features_kept = X_train.columns.values[selector.support_] 
X_train = X_train[features_kept]
X_test = X_test[features_kept]
The number of selected features is: 89
In [49]:
print("Display attributes kept:")
(features_kept)
Display attributes kept:
Out[49]:
array(['BedroomAbvGr', 'EnclosedPorch', 'Fireplaces', 'GarageArea',
       'GarageCars', 'GarageYrBlt', 'KitchenAbvGr', 'LotArea',
       'LotFrontage', 'OpenPorchSF', 'PoolArea', 'ScreenPorch',
       'TotRmsAbvGrd', 'WoodDeckSF', 'TotalSF', 'TotalBath', 'AgeAtSell',
       'RemodAgeAtSell', 'Alley_None', 'BldgType_1Fam', 'BsmtCond_Fa',
       'BsmtExposure_Gd', 'BsmtExposure_No', 'BsmtFinType1_ALQ',
       'BsmtFinType1_GLQ', 'BsmtFinType1_Unf', 'BsmtQual_Ex',
       'BsmtQual_Gd', 'CentralAir_Y', 'Condition1_Artery',
       'Condition1_Norm', 'ExterCond_Fa', 'ExterQual_TA',
       'Exterior1st_BrkComm', 'Exterior1st_BrkFace',
       'Exterior1st_MetalSd', 'Fence_GdWo', 'Fence_None',
       'Functional_Maj1', 'Functional_Maj2', 'Functional_Typ',
       'GarageCond_TA', 'GarageFinish_Fin', 'GarageQual_TA',
       'GarageType_Attchd', 'Heating_GasA', 'Heating_Grav',
       'HeatingQC_Ex', 'HouseStyle_1.5Fin', 'HouseStyle_2Story',
       'KitchenQual_Ex', 'KitchenQual_Gd', 'KitchenQual_TA',
       'LandSlope_Mod', 'LotConfig_Corner', 'LotShape_IR1',
       'LotShape_Reg', 'MSSubClass_30', 'MSSubClass_50',
       'MSZoning_C (all)', 'MSZoning_RL', 'MSZoning_RM',
       'Neighborhood_ClearCr', 'Neighborhood_Crawfor',
       'Neighborhood_Edwards', 'Neighborhood_MeadowV',
       'Neighborhood_NAmes', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'OverallCond_2', 'OverallCond_3',
       'OverallCond_4', 'OverallCond_5', 'OverallCond_7', 'OverallCond_8',
       'OverallCond_9', 'OverallQual_3', 'OverallQual_4', 'OverallQual_5',
       'OverallQual_7', 'OverallQual_8', 'OverallQual_9',
       'OverallQual_10', 'PavedDrive_N', 'PavedDrive_Y',
       'RoofMatl_CompShg', 'RoofMatl_WdShngl', 'SaleCondition_Abnorml',
       'SaleCondition_Normal'], dtype=object)
In [50]:
print("Display the shape of training set and test set")
X_train.shape, y_train.shape, X_test.shape 
Display the shape of training set and test set
Out[50]:
((1200, 89), (1200,), (260, 89))

Training Models and Evaluations

1. Select and Train a Model

We will find and training our Models via three stages in order to find the best best:
  • We pick out 6 models in order to do the experiment such as: Ridge(), Lasso(), SVR(), KernelRidge(), ElasticNet(), BayesianRidge()
  • Start to fit the models in order to get the initial Means Test Score and default paramenters
  • Use GridSearchCV to find the upper limit and lower limit of the parameters
  • Based on the parameters from GridSearch, use the RandomizedSearchCV to find the best Mean Test Score
In [53]:
models = [Ridge(),
          Lasso(),
          SVR(),
          KernelRidge(),
          ElasticNet(),
          BayesianRidge()]
names = ['Ridge', 'Lasso', 'SVR', 'KernelRidge', 'ElasticNet','BayesianRidge']

# define cross validation strategy
def rmse_cv(model,X,y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    return rmse
dftemp = pd.DataFrame(columns=['Parameters','Mean Test Score'], index=names)

# Run model
for name, model in zip(names, models):
    score = rmse_cv(model, X_train, y_train)
    dftemp['Parameters'][name] = model
    dftemp['Mean Test Score'][name] = score.mean()
    #dftemp['Std Test Score'][name] = score.std()
dftemp.sort_values(by='Mean Test Score')
Out[53]:
Parameters Mean Test Score
BayesianRidge BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, co... 0.12311
Ridge Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr... 0.123379
SVR SVR(C=1.0, cache_size=200, coef0=0.0, degree=3... 0.184292
ElasticNet ElasticNet(alpha=1.0, copy_X=True, fit_interce... 0.397172
Lasso Lasso(alpha=1.0, copy_X=True, fit_intercept=Tr... 0.397641
KernelRidge KernelRidge(alpha=1, coef0=1, degree=3, gamma=... 0.648888
In [84]:
dftemp['Parameters']['KernelRidge']
Out[84]:
KernelRidge(alpha=1, coef0=1, degree=3, gamma=None, kernel='linear',
      kernel_params=None)

2. Fine-Tune Your Model

2.1 GridSearchCV

To find down the good combinations of hyperparameter values to fine-tune the models, we get Scikit-Learn's GridSearchCV to search
In [77]:
# Define a gridsearch method for hyperparameters tuning
class grid():
    def __init__(self,model):
        self.model = model
    
    def grid_get(self,X,y,param_grid,modelname,resultdf):
        grid_search = GridSearchCV(self.model,
                                   param_grid,
                                   cv=5, 
                                   scoring='neg_mean_squared_error')
        grid_search.fit(X,y)
        print('Best Parameter: ', grid_search.best_params_)
        print('Best RMSE: ', np.sqrt(-grid_search.best_score_))
        grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])
        resultdf.loc[modelname] = [grid_search.best_params_,np.sqrt(-grid_search.best_score_)]
griddf = pd.DataFrame(columns=['Parameters','Mean Test Score'])
In [78]:
# Lasso
param_grid = {'alpha': [0.0005, 0.005, 0.01],'warm_start': [True, False], 
              'selection': ['cyclic', 'random'],'fit_intercept': [True, False], 'max_iter':[10000]}
grid(Lasso()).grid_get(X_train,y_train,param_grid,'GridSearch_Lasso',griddf)
Best Parameter:  {'warm_start': True, 'selection': 'random', 'max_iter': 10000, 'fit_intercept': True, 'alpha': 0.0005}
Best RMSE:  0.1251896320267204
In [79]:
# Ridge
param_grid = {'alpha': [0.5,1.0,1.5], 
              'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga'],
              'fit_intercept': [True, False]}
grid(Ridge()).grid_get(X_train,y_train,param_grid,'GridSearch_Ridge',griddf)
Best Parameter:  {'solver': 'svd', 'fit_intercept': True, 'alpha': 1.5}
Best RMSE:  0.12518115368229382
In [80]:
#SVR
param_grid = {'C':[3,6,9,12,15],'kernel':['rbf'],'gamma':[1/len(X_train)],'epsilon':[0.05,0.1,0.9]}
grid(SVR()).grid_get(X_train,y_train,param_grid,'GridSearch_SVR',griddf)
Best Parameter:  {'C': 15, 'gamma': 0.0008333333333333334, 'kernel': 'rbf', 'epsilon': 0.05}
Best RMSE:  0.15326343877518842
In [81]:
# Kernel Ridge 
param_grid={'alpha':[0.5,0.9], 'kernel':['polynomial'],'gamma':[0.1,0.5], 'degree':[3],'coef0':[0.8,1]}
grid(KernelRidge()).grid_get(X_train,y_train,param_grid,'GridSearch_KernelRidge',griddf)
Best Parameter:  {'gamma': 0.1, 'degree': 3, 'kernel': 'polynomial', 'coef0': 1, 'alpha': 0.5}
Best RMSE:  1.041344330948054
In [82]:
# ElasticNet 
param_grid = {'alpha': [0.5, 1.0, 1.5], 
              'l1_ratio': [0.3, 0.5, 0.9],
              'selection': ['cyclic', 'random'],
              'fit_intercept': [True, False],'max_iter':[10000]}
grid(ElasticNet()).grid_get(X_train,y_train,param_grid,'GridSearch_ElasticNet',griddf)
Best Parameter:  {'selection': 'random', 'max_iter': 10000, 'l1_ratio': 0.3, 'fit_intercept': True, 'alpha': 0.5}
Best RMSE:  0.30369316125888773
In [83]:
# BayesianRidge
param_grid = {'tol':[0.01,0.001,0.0009], 'alpha_1':[1e-05,1e-6, 1e-7], 
              'alpha_2':[1e-05,1e-6, 1e-7], 'lambda_1':[1e-05,1e-6, 1e-7], 
              'lambda_2':[1e-05,1e-6, 1e-7], 'n_iter':[100000]}
grid(BayesianRidge()).grid_get(X_train,y_train, param_grid,'GridSearch_BayesianRidge',griddf)
Best Parameter:  {'alpha_1': 1e-07, 'alpha_2': 1e-05, 'tol': 0.001, 'lambda_1': 1e-05, 'n_iter': 100000, 'lambda_2': 1e-05}
Best RMSE:  0.1251760650411168
In [84]:
griddf.sort_values(by='Mean Test Score')
Out[84]:
Parameters Mean Test Score
GridSearch_BayesianRidge {'alpha_1': 1e-07, 'alpha_2': 1e-05, 'tol': 0.... 0.125176
GridSearch_Ridge {'solver': 'svd', 'fit_intercept': True, 'alph... 0.125181
GridSearch_Lasso {'warm_start': True, 'selection': 'random', 'm... 0.125190
GridSearch_SVR {'C': 15, 'gamma': 0.0008333333333333334, 'ker... 0.153263
GridSearch_ElasticNet {'selection': 'random', 'max_iter': 10000, 'l1... 0.303693
GridSearch_KernelRidge {'gamma': 0.1, 'degree': 3, 'kernel': 'polynom... 1.041344

2.2 RandomizedSearchCV

In [69]:
# Define a RandomizedSearch method for hyperparameters tuning
class grid():
    def __init__(self,model):
        self.model = model
    
    def random_get(self,X,y,param_grid,modelname,resultdf):
        random_search = RandomizedSearchCV(self.model,
                                           param_grid,
                                           cv=5, 
                                           scoring='neg_mean_squared_error',
                                           n_jobs = -1,
                                           n_iter = 1000,
                                           random_state=0)
        random_search.fit(X,y)
        print('Best Parameter: ', random_search.best_params_)
        print('Best RMSE: ', np.sqrt(-random_search.best_score_))
        random_search.cv_results_['mean_test_score'] = np.sqrt(-random_search.cv_results_['mean_test_score'])
        resultdf.loc[modelname] = [random_search.best_params_,np.sqrt(-random_search.best_score_)]
randdf = pd.DataFrame(columns=['Parameters','Mean Test Score'])
In [72]:
#Lasso 
param_grid = {'alpha': ss.uniform(0.0002,0.0003)}
grid(Lasso()).random_get(X_train,y_train,param_grid,'Randomized_Lasso',randdf)
Best Parameter:  {'alpha': 0.0003366332448350081}
Best RMSE:  0.1250133807524655
In [73]:
#Ridge 
param_grid = {'alpha':ss.uniform(0.9,1)}
grid(Ridge()).random_get(X_train,y_train,param_grid,'Randomized_Ridge',randdf)
Best Parameter:  {'alpha': 1.8998085781169654}
Best RMSE:  0.12503360834952465
In [74]:
#SVR 
param_grid = {'C':sp_randint(14,18),'kernel':['rbf'],'gamma':[1/len(X_train)],'epsilon':ss.uniform(0.008,0.009)}
grid(SVR()).random_get(X_train,y_train,param_grid,'Randomized_SVR',randdf)
Best Parameter:  {'C': 17, 'gamma': 0.0008333333333333334, 'kernel': 'rbf', 'epsilon': 0.016850905804111187}
Best RMSE:  0.15263442896444399
In [75]:
#KernelRidge 
param_grid = {'alpha': ss.uniform(0.05, 1.0), 'kernel': ['polynomial'], 'gamma':ss.uniform(0.1,0.5),
              'degree': [2], 'coef0':uniform(0.5, 3.5)}
grid(KernelRidge()).random_get(X_train,y_train,param_grid,'Randomized_KernelRidge',randdf)
Best Parameter:  {'gamma': 0.11580272141020279, 'degree': 2, 'kernel': 'polynomial', 'coef0': 3.4081807922215157, 'alpha': 0.9366832208474524}
Best RMSE:  0.23667976603360452
In [70]:
#ElasticNet 
param_grid = {'alpha':ss.uniform(0.3,0.5),'l1_ratio':ss.uniform(0.1,0.2)}
grid(ElasticNet()).random_get(X_train,y_train,param_grid,'Randomized_ElasticNet',randdf)
Best Parameter:  {'l1_ratio': 0.10185689010118543, 'alpha': 0.303101707354683}
Best RMSE:  0.19612511053785303
In [76]:
#Bayesian Ridge
param_grid = {'tol':[0.001], 'alpha_1':ss.gamma(1e-7, 1e-8), 
              'alpha_2':ss.gamma(1e-4, 1e-5), 'lambda_1':ss.gamma(1e-4, 1e-5), 
              'lambda_2':ss.gamma(1e-4, 1e-5)}
grid(BayesianRidge()).random_get(X_train,y_train, param_grid,'Randomized_BayesianRidge',randdf)
Best Parameter:  {'alpha_1': 1e-08, 'alpha_2': 1e-05, 'tol': 0.001, 'lambda_2': 1e-05, 'lambda_1': 0.7209173500996642}
Best RMSE:  0.12517258924655614
In [86]:
randdf
Out[86]:
Parameters Mean Test Score
Randomized_ElasticNet {'l1_ratio': 0.10185689010118543, 'alpha': 0.3... 0.196125
Randomized_Lasso {'alpha': 0.0003366332448350081} 0.125013
Randomized_Ridge {'alpha': 1.8998085781169654} 0.125034
Randomized_SVR {'C': 17, 'gamma': 0.0008333333333333334, 'ker... 0.152634
Randomized_KernelRidge {'gamma': 0.11580272141020279, 'degree': 2, 'k... 0.236680
Randomized_BayesianRidge {'alpha_1': 1e-08, 'alpha_2': 1e-05, 'tol': 0.... 0.125173
In [87]:
ranking = pd.concat([griddf, randdf])
ranking
Out[87]:
Parameters Mean Test Score
GridSearch_Lasso {'warm_start': True, 'selection': 'random', 'm... 0.125190
GridSearch_Ridge {'solver': 'svd', 'fit_intercept': True, 'alph... 0.125181
GridSearch_SVR {'C': 15, 'gamma': 0.0008333333333333334, 'ker... 0.153263
GridSearch_KernelRidge {'gamma': 0.1, 'degree': 3, 'kernel': 'polynom... 1.041344
GridSearch_ElasticNet {'selection': 'random', 'max_iter': 10000, 'l1... 0.303693
GridSearch_BayesianRidge {'alpha_1': 1e-07, 'alpha_2': 1e-05, 'tol': 0.... 0.125176
Randomized_ElasticNet {'l1_ratio': 0.10185689010118543, 'alpha': 0.3... 0.196125
Randomized_Lasso {'alpha': 0.0003366332448350081} 0.125013
Randomized_Ridge {'alpha': 1.8998085781169654} 0.125034
Randomized_SVR {'C': 17, 'gamma': 0.0008333333333333334, 'ker... 0.152634
Randomized_KernelRidge {'gamma': 0.11580272141020279, 'degree': 2, 'k... 0.236680
Randomized_BayesianRidge {'alpha_1': 1e-08, 'alpha_2': 1e-05, 'tol': 0.... 0.125173
In [88]:
ranking = ranking.sort_values(by='Mean Test Score')
ranking
Out[88]:
Parameters Mean Test Score
Randomized_Lasso {'alpha': 0.0003366332448350081} 0.125013
Randomized_Ridge {'alpha': 1.8998085781169654} 0.125034
Randomized_BayesianRidge {'alpha_1': 1e-08, 'alpha_2': 1e-05, 'tol': 0.... 0.125173
GridSearch_BayesianRidge {'alpha_1': 1e-07, 'alpha_2': 1e-05, 'tol': 0.... 0.125176
GridSearch_Ridge {'solver': 'svd', 'fit_intercept': True, 'alph... 0.125181
GridSearch_Lasso {'warm_start': True, 'selection': 'random', 'm... 0.125190
Randomized_SVR {'C': 17, 'gamma': 0.0008333333333333334, 'ker... 0.152634
GridSearch_SVR {'C': 15, 'gamma': 0.0008333333333333334, 'ker... 0.153263
Randomized_ElasticNet {'l1_ratio': 0.10185689010118543, 'alpha': 0.3... 0.196125
Randomized_KernelRidge {'gamma': 0.11580272141020279, 'degree': 2, 'k... 0.236680
GridSearch_ElasticNet {'selection': 'random', 'max_iter': 10000, 'l1... 0.303693
GridSearch_KernelRidge {'gamma': 0.1, 'degree': 3, 'kernel': 'polynom... 1.041344
We pick the best value is Randomized Lasso Model with parameter alpha = 0.0003366332448350081
In [95]:
model = Lasso(alpha= 0.0003366332448350081)
model.fit(X_train,y_train)
Out[95]:
Lasso(alpha=0.0003366332448350081, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
In [97]:
# Make predictions on the test set
y_pred = np.exp(model.predict(X_test))
output = pd.DataFrame({'Id': dftest['Id'], 'SalePrice': y_pred})
output.to_csv('prediction.csv', index=False)